Chinese document layout analysis using an adaptive regrouping strategy
نویسندگان
چکیده
In document layout analysis, the defining conditions for textlines and text regions involve certain numerical parameters (e.g. inter-character spacing and inter-textline spacing) whose values can only be estimated when textlines and text regions have already been formed. This seemingly chicken-and-egg problem can be solved through an adaptive regrouping strategy, which consists of three operations. First, we group basic ingredients into preliminary textlines and text regions according to crude parametric values. Second, we refine our estimate of the parametric values based on the groups thus formed. Third, we form new groups by splitting and merging existing groups based on the newly estimated values. This paper applies the above strategy to Chinese documents whose complexity derives from the coexistence of horizontal and vertical textlines. Successful results are obtained using this approach. The accuracy rates for identifying text regions and textlines are above 98% in a test database consisting of over one thousand document samples and various layout structures.
منابع مشابه
Distributed Autonomous Agents for Chines Document Images Segmentation
In Chinese document image processing, text and/or graphical block detection serves as an essential step in document layout analysis that in turn permits the eeective reasoning about the logical relationships among various text paragraphs and graphical entities for the purpose of document understanding. This paper presents a novel computational paradigm for extracting text/graphic blocks from Ch...
متن کاملBasedon Adaptive Split - and - Merge andQualitative Spatial Reasoning
The ultimate goal of automatic document processing is to understand the semantics of a document. Towards such an end, one of the primary enabling steps has been to rst reason about the layout of the document by means of page segmentation and segment spatial reasoning or labeling. This, in turn, allows for the derivation of document logical organization. This paper describes a generic document s...
متن کاملMulti-view hac for Semi-supervised Document Image Classification
This paper presents a semi-supervised document image classification system that aims to be integrated into a commercial document reading software. This system is asserted like an annotation help. From a set of unknown document images given by a human operator, the system computes regrouping hypothesis of same physical layout images and proposes them to the operator. Then he can correct them, va...
متن کاملWord Spotting in Chinese Document Images without Layout Analysis
An approach to searching user-specified words/phrases in Chinese document images, without the requirements of layout analysis, is proposed in this paper. Bounding boxes of Chinese character images are first determined using connected component analysis. Next, a suitable character from the user-specified word/phrase is chosen as the initial character to search for a matching candidate in the doc...
متن کاملSelective CRLA based Layout Analysis and Text Region Extraction from Low Quality Document Images
This paper aims at detecting textual regions by separating graphical regions using Selective CRLA scheme and statistical textual properties on noise infected and low resolution newspaper images. A Bottom Up approach is adopted (i.e.) Selective Constrained Run Length algorithm (CRLA) is applied to obtain the layouts and region growing method over it, segments the homogeneous regions. Statistical...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Pattern Recognition
دوره 38 شماره
صفحات -
تاریخ انتشار 2005